Here we show how embeddr (= spectral embedding + principal curves) can be used for pseudotemporal ordering of single-cell gene expression data using the monocle dataset.
library(monocle) ## for monocle data
library(devtools) ## for package development
library(reshape2) ## to melt data frames
library(plyr)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(caret) ## for cross validation steps
library(scater) ## to hold single-cell data
library(knitr) ## for kable function
library(goseq) ## for GO enrichment
library(org.Hs.eg.db) ## for HG19 GO annotations
load_all('.')
data(HSMM)
## add cell_id to HSMM to play nicely with dplyr
phenoData(HSMM)$cell_id <- rownames(pData(HSMM))
First we go through cleaning the monocle dataset and selecting for marker genes only:
marker_genes <- row.names(subset(fData(HSMM),
gene_short_name %in% c("MEF2C", "MEF2D", "MYF5", "ANPEP", "PDGFRA",
"MYOG", "TPM1", "TPM2", "MYH2", "MYH3", "NCAM1",
"CDK1", "CDK2", "CCNB1", "CCNB2", "CCND1", "CCNA")))
x <- log(exprs(HSMM[marker_genes,]) + 1)
x <- t(scale(t(x)))
sce <- fromCellDataSet(HSMM[marker_genes,])
## Defining 'is_exprs' using exprsData and a lower exprs threshold of 0.1
exprs(sce) <- x
The embeddr workflow creates the nearest neighbour graph then returns the laplacian eigenmap. We can do this using embeddr::embeddr with all options available, though embeddr::weighted_graph and embeddr::laplacian_eigenmap are available to perform each step by hand or with custom distance metrics. The default options specify a nearest neighbour graph with \(k = round(log(n))\) neighbours for \(n\) cells. Other options for creating the graph (such as distance measures and heat kernels) are also available.
sce <- embeddr(sce)
plot_embedding(sce)
The function embeddr::plot_embedding can be used at any time on the appropriate data.frame objects and will display all relevant information. We can start by seeing where the different inferred states from the monocle dataset fall on our embedding:
HSMM <- setOrderingFilter(HSMM, marker_genes)
HSMM <- reduceDimension(HSMM, use_irlba = F)
## Reducing to independent components
HSMM <- orderCells(HSMM, num_paths=2, reverse=F)
phenoData(sce)$monocle_state <- pData(HSMM)$State
plot_embedding(sce, color_by = 'monocle_state')
So there appears to be reasonable correspondance between the monocle clusters and natural clusters in our data. The embeddr::cluster_embedding function applies k-means clustering to the dataset:
set.seed(123)
sce <- cluster_embedding(sce, k=3, method='mm')
## Warning: package 'mclust' was built under R version 3.1.2
plot_embedding(sce)
In standard manifold learning problems it is recommended that each feature is appropriately scaled to have mean 0 and variance 1. Since laplacian eigenmaps depend only on the distance between data points, scaling the features is equivalent to treating the variance in each gene equally no matter how large or small it was originally. As a result, the variance in genes that don’t truly vary that much (such as housekeeping genes) is as important as the genes that really do vary with the biological process of interest, adding much noise to the process.
There are several ways to get round this problem. We can choose genes we a priori know vary with the biological process as we have done above (where the genes chosen are those associated with muscle development). An alternative is to choose genes whose coefficient of variation (CV2) is above some threshold or in a particular percentile:
x <- t(log10(exprs(HSMM) + 1))
x_mean <- colMeans(x)
x_var <- apply(x, 2, var)
genes_for_fit <- x_mean > 0.3
CV2 <- x_var[genes_for_fit] / (x_mean[genes_for_fit])^2
df_fit <- data.frame(m = x_mean[genes_for_fit], CV2 = CV2)
fit_loglin <- nls(CV2 ~ a * 10^(-k * m), data = df_fit, start=c(a=5, k=1))
ak <- coef(fit_loglin)
f <- function(x) ak[1] * 10^(-ak[2] * x)
genes_for_embedding <- (CV2 > 4 * predict(fit_loglin))
df_fit$for_embedding <- as.factor(genes_for_embedding)
ggplot(df_fit, aes(x=m, y=CV2, color = for_embedding)) + geom_point() +
theme_bw() + xlab('Mean') + ylab('CV2') + scale_color_fivethirtyeight() +
stat_function(fun=f, color='black')
Once we have picked the genes we want to use for the embedding, we scale them to treat them as equal to avoid high magnitude outliers driving the embedding:
set.seed(123)
sce <- fromCellDataSet(HSMM)
## Defining 'is_exprs' using exprsData and a lower exprs threshold of 0.1
sce@lowerDetectionLimit <- log10(0.1 + 1) ## monocle lower detection limit is 0.1
exprs(sce) <- log10(exprs(sce) + 1)
gene_indices <- match(names(which(genes_for_embedding)), featureNames(sce))
#sce_embedding <- sce[gene_indices,]
sce <- embeddr(sce, genes_for_embedding = gene_indices)
sce_tmp <- sce
phenoData(sce_tmp)$State <- plyr::mapvalues(pData(sce_tmp)$State, from=1:3,
to=c('Differentiating myoblast',
'Proliferating cell',
'Interstitial mesenchymal cell'))
plot_embedding(sce_tmp, color_by = 'State')
We can view the graph of cells to see how connections between different parts form:
plot_graph(sce)
We can also cluster the embedding using kmeans and plot:
sce <- cluster_embedding(sce, 3)
sce_tmp <- sce
phenoData(sce_tmp)$cluster <- plyr::mapvalues(pData(sce_tmp)$cluster, from=c(3, 1, 2),
to=c(1,2,3))
phenoData(sce_tmp)$cluster <- plyr::mapvalues(pData(sce_tmp)$cluster, from=1:3,
to=c('Differentiating myoblast',
'Proliferating cell',
'Interstitial mesenchymal cell'))
plot_embedding(sce_tmp)
Finally, we want to check the method is different to a more primitive technique such as PCA alone:
pca <- princomp(t(exprs(sce[gene_indices,])))
df_pca <- data.frame(x1 = pca$scores[,1], x2 = pca$scores[,2], monocle_state = pData(sce)$State)
ggplot(df_pca, aes(x=x2, y=x1, color=monocle_state)) + geom_point() +
theme_bw() + scale_color_fivethirtyeight()
In the monocle paper they show that groups 1 & 3 correspond to differentiating cells while group 2 is contamination. We can separate off groups 1 & 3, fit the pseudotime trajectories and plot:
sce_23 <- sce[, pData(sce)$cluster %in% c(1,3)]
sce_23 <- fit_pseudotime(sce_23)
## Warning: package 'princurve' was built under R version 3.1.2
#save(sce_23, file='/net/isi-scratch/kieran/embeddr/embeddr/data/sce_23.Rdata')
plot_embedding(sce_23)
We can also compare our pseudotime with that of monocle:
in_state23 <- pData(sce_23)$cell_id
monocle_df <- filter(pData(HSMM), cell_id %in% in_state23)
qplot(arrange(monocle_df, cell_id)$Pseudotime, arrange(pData(sce_23), cell_id)$pseudotime) +
theme_bw() + xlab('Monocle pseudotime') + ylab('embeddr pseudotime')
So there is good correspondence between the monocle and embeddr pseudotimes, though it appears that the embeddr version goes in the wrong direction. Pseudotimes are equivalent up to parity and scaling transformations (which is perhaps more philosophical than it sounds), so we can use the embeddr::reverse_pseudotime function to make the pseudotimes ‘run’ in the same direction:
sce_23 <- reverse_pseudotime(sce_23)
qplot(arrange(monocle_df, cell_id)$Pseudotime, arrange(pData(sce_23), cell_id)$pseudotime) +
theme_bw() + xlab('Monocle pseudotime') + ylab('embeddr pseudotime')
sce_23 <- reverse_pseudotime(sce_23)
The overall correlation between the two pseudotime trajectories is -0.80, which is pretty good.
To plot the genes in pseudotime we need to provide the original gene values for the cells in clusters 1 & 3:
#xp <- select(data.frame(t(x)), one_of(Mp$cell_id))
genes_to_plot <- row.names(subset(fData(HSMM),
gene_short_name %in% c("CDK1", "MEF2C", "MYH3", "MYOG","ID1")))
plot_in_pseudotime(sce_23[genes_to_plot,])
which is largely similar to the monocle equivalent.
We can also plot a heatmap of all the genes in pseudotime:
plot_heatmap(sce_23[gene_indices,])
Let’s have a look at the MRF family of transcription factors:
mrf <- c('MYOD1', 'MYF5', 'MYF6', 'MYOG')
mrf_ind <- sapply(mrf, match, fData(sce_23)$gene_short_name)
plot_in_pseudotime(sce_23[mrf_ind,])
plot_pseudotime_model(sce_23[mrf_ind,])
We can also show that the overall shape of the embedding is robust to a varying choice of nearest neighbours:
nns <- c(5,6,8,10,15,20)
embeddings <- ldply(nns, function(i) {
sce_map <- embeddr(sce, genes_for_embedding = gene_indices, nn = i)
return(dplyr::select(pData(sce_map), component_1, component_2))
})
embeddings$nn <- rep(nns, each = dim(sce)[2])
embeddings$monocle_state <- rep(pData(sce)$State, times = length(nns))
ggplot(data=embeddings, aes(x=component_1, y=component_2, color=monocle_state)) + geom_point() +
theme_bw() + facet_wrap(~ nn) +
ggtitle('Overall shape for variety of nearest neighbours')
In the original monocle dataset, only a small number of cells were assigned to group 3 (‘interstitial mesenchymal cell’). However, our re-analysis suggests it is somewhat a larger with 116 cells in total. These cells were identified as mesenchymal cells as they expressed PDGFRA and SPHK1 in high abundances. We can look to see whether in our cell set we’ve identified more cells that are potentially contamination:
phenoData(sce)$monocle_classification <- mapvalues(pData(sce)$State, from=1:3, c('non-cont','non-cont','cont'))
phenoData(sce)$embeddr_classification <- mapvalues(pData(sce)$cluster, from=1:3, c('non-cont','cont','non-cont'))
print(kable(as.data.frame(table(dplyr::select(pData(sce), monocle_classification, embeddr_classification)))))
| monocle_classification | embeddr_classification | Freq |
|---|---|---|
| non-cont | cont | 57 |
| cont | cont | 59 |
| non-cont | non-cont | 153 |
| cont | non-cont | 2 |
So embeddr calls significatnly more cells as contaminated, while calls none as non-contaminated that monocle calls contaminated.
What we’d like to do is see how the gene markers compare when considering the contamined cells identified by monocle, the contaminated cells we identify, and all other cells:
agree_contamination <- apply(dplyr::select(pData(sce), monocle_classification, embeddr_classification),
1, function(x) {
if(x['monocle_classification'] == 'cont' & x['embeddr_classification'] == 'cont') {
return('Agree contamination')
} else if(x['monocle_classification'] != 'cont' & x['embeddr_classification'] == 'cont') {
return('Embeddr only')
} else if(x['monocle_classification'] == 'cont') {
return('Monocle only')
} else {
return('Agree no contamination')
}
})
cont_genes <- row.names(subset(fData(HSMM), gene_short_name %in% c("PDGFRA", "SPHK1")))
short_names <- fData(HSMM)[cont_genes,]$gene_short_name
y <- exprs(sce[cont_genes,])
#y <- scale(t(y))
y <- data.frame(t(y))
y$agree_contamination <- agree_contamination
y <- melt(y, id.vars='agree_contamination', variable.name='gene', value.name='counts')
y$gene <- mapvalues(y$gene, from = cont_genes, to = short_names)
ggplot(y, aes(x=factor(agree_contamination), color=factor(agree_contamination), y=counts)) +
facet_wrap(~gene) +
geom_boxplot(outlier.shape=NA) + geom_jitter(alpha=0.5) + theme_bw() + xlab('') +
theme(axis.text.x = element_text(angle = -50, hjust=0)) + scale_color_fivethirtyeight(guide=FALSE) +
ylab('log10(FPKM + 1) counts') + ggtitle('Expression of mesenchymal markers')
So it looks like they might have missed a few cells expressing PDGFRA and SPHK1. We can also plot the markers in the embedded space, as per in the original paper:
cont_genes <- row.names(subset(fData(HSMM), gene_short_name %in% c("CDK1", "MYOG", "PDGFRA",
"SPHK1", "MYF5", "NCAM1")))
short_names <- fData(HSMM)[cont_genes,]$gene_short_name
y <- exprs(sce[cont_genes,])
y <- t(y)
y <- data.frame(y)
M_y <- cbind(dplyr::select(pData(sce), component_1, component_2, cluster), y)
M_y <- dplyr::rename(M_y, Cluster = cluster)
M_y_melted <- melt(M_y, id.vars=c('component_1','component_2','Cluster'),
variable.name='gene', value.name='count')
M_y_melted$gene <- mapvalues(M_y_melted$gene, from = cont_genes, to = short_names)
ggplot(M_y_melted, aes(x=component_1, y=component_2, size=count, color=factor(Cluster))) +
geom_point(alpha=0.6) + facet_wrap(~ gene) + theme_bw() +
scale_color_fivethirtyeight(name='Cluster') + scale_size_continuous(name='log10(FPKM)') +
xlab('Component 1') + ylab('Component 2')
Interestingly there appears to be a subset of cells that express both CDK1 and the contamination markers PDGFRA and SPHK1. Let’s look at those we’ve designated as contaminated and plot the expression:
cdk1_ind <- match('CDK1', fData(HSMM)$gene_short_name)
pdgfra_ind <- match('PDGFRA', fData(HSMM)$gene_short_name)
is_contaminated <- pData(sce)$cluster == 2
cx <- exprs(sce[c(cdk1_ind, pdgfra_ind), is_contaminated])
cx <- data.frame(t(cx))
names(cx) <- c('CDK1','PDGFRA')
ggplot(data=cx) + geom_point(aes(x=CDK1, y=PDGFRA)) +
theme_minimal() + xlab('CDK1 expression log10(FPKM + 1)') +
ylab('PDGFRA expression log10(FPKM + 1)') +
ggtitle('Coexpression of PDGFRA and CDK1 in contaminated cells')
Subsample to check robustness:
ncv <- 8
inds <- createFolds(1:ncol(sce), ncv)
embeddings <- ldply(1:ncv, function(i) {
sce_reduced <- sce[gene_indices ,-inds[[i]] ]
sce_reduced <- embeddr(sce_reduced)
M_tmp <- dplyr::select(pData(sce_reduced), component_1, component_2, cluster)
M_tmp$fold <- i
M_tmp
})
embeddings$cluster <- as.factor(embeddings$cluster)
ggplot(embeddings) + geom_point(aes(x=component_1, y=component_2, color=cluster)) +
facet_wrap(~ fold) + theme_bw()
Try the same with 1/3 of the data:
ncv <- 3
inds_3 <- createFolds(1:ncol(sce), k=ncv)
embeddings_3 <- ldply(1:ncv, function(i) {
sce_reduced <- sce[gene_indices ,-inds_3[[i]] ]
sce_reduced <- embeddr(sce_reduced, nn = 5)
M_tmp <- dplyr::select(pData(sce_reduced), component_1, component_2, cluster)
M_tmp$fold <- i
M_tmp
})
embeddings_3$cluster <- as.factor(embeddings_3$cluster)
ggplot(embeddings_3) + geom_point(aes(x=component_1, y=component_2, color=cluster)) +
facet_wrap(~ fold) + theme_bw()
And see that the pseudotime fits are roughly constant:
save(sce, gene_indices, file='~/oxford/embeddr//data/saved.Rdata')
set.seed(45)
folds <- createMultiFolds(1:ncol(sce), k = 5, times = 30)
all_embeddings <- lapply(folds, function(ind) {
#sce_tmp <- sce[gene_indices, ind]
sce_tmp <- embeddr(sce[gene_indices,ind])#, genes_for_embedding = gene_indices)
sce_tmp
})
#all_embeddings_unlist <- unlist(all_embeddings, recursive=FALSE)
psts <- lapply(all_embeddings, function(sce_tmp) {
sce_tmp_13 <- sce_tmp[,pData(sce_tmp)$cluster %in% c(1,3)]
sce_tmp_13 <- fit_pseudotime(sce_tmp_13 )
sce_tmp_13$pseudotime
})
corrs <- sapply(1:length(all_embeddings), function(i) {
sce_i <- all_embeddings[[i]]
cells_in_i <- pData(sce_i)$cell_id
sce_23_red <- sce_23[, pData(sce_23)$cell_id %in% cells_in_i]
abs(cor(psts[[i]], sce_23_red$pseudotime))
})
qplot(corrs) + theme_minimal() +
ggtitle('Correlation between sampled pseudotimes') +
xlab('Correlation')
Model a single gene in pseudotime:
cdk1 <- row.names(subset(fData(HSMM), gene_short_name == 'CDK1'))
sce_23@lowerDetectionLimit <- log10(1 + 0.1)
model <- fit_pseudotime_model(sce_23, cdk1)
## Warning: package 'AER' was built under R version 3.1.2
## Warning: package 'car' was built under R version 3.1.2
null_model <- fit_null_model(sce_23, cdk1)
p_val <- compare_models(model, null_model)
plot_pseudotime_model(sce_23[cdk1, ], model)
## Warning in if (class(models) == "list") {: the condition has length > 1 and
## only the first element will be used
Now try all of them:
n_cells_expressed <- rowSums(is_exprs(sce_23))
keep_gene <- n_cells_expressed > 0.2 * dim(sce_23)[2]
sce_23_kept <- sce_23[keep_gene,]
## can optionally use pre-computed models in the pseudotime test function.
## this is really handy as we can keep these around for predicting and plotting
## without having to compute them every time
diff_gene_test <- pseudotime_test(sce_23_kept, n_cores = 1)
And plot p-vals:
qplot(diff_gene_test$p_val, binwidth = 0.01) + theme_bw() + xlab('p-value')
qplot(diff_gene_test$q_val, binwidth = 0.01) + theme_bw() + xlab('corrected p-value')
alpha <- 0.01
sig_genes <- diff_gene_test$gene[diff_gene_test$q_val < alpha]
sce_sig <- sce_23_kept[sig_genes,]
Use the predicted expression from the fitted models to cluster genes by pseudotemporal expression:
Now we can calulate the predicted expression matrix:
## let's get the predicted expression matrix from the models
pe <- predicted_expression(sce_sig)
And we can plot the correlation plot:
pcor <- cor(pe)
dist_cor <- 1 - pcor / 2
hc <- hclust(as.dist(dist_cor))
plot(hc, labels=FALSE)
Now look at conserved modules:
n_cuts <- 6
gene_classes <- cutree(hc, n_cuts)
df_gene <- data.frame(gene=colnames(pe), class=gene_classes)
pe <- data.frame(scale(pe)) ## scaled pst-by-gene
pe$pseudotime <- pseudotime(sce_sig)
## save(pe, df_gene, file='~/pe.Rdata')
pe_melted <- melt(pe, id.vars='pseudotime', value.name='expression', variable.name='gene')
pe_melted <- inner_join(pe_melted, df_gene)
## we want to plot the mean expression rather than all of it (messy)
gp_groups <- group_by(pe_melted, class, pseudotime)
mean_expr_per_group <- dplyr::summarise(gp_groups, mean_expr = mean(expression))
pe_melted <- inner_join(pe_melted, mean_expr_per_group)
## pe_melted <- arrange(pe_melted, gene, pseudotime)
ggplot(pe_melted) + geom_line(aes(x=pseudotime, y=mean_expr), color='red') +
facet_wrap(~ class, ncol = 1) + stat_density2d(aes(x=pseudotime, y=expression), n=150) +
theme_bw() + ylab('Expression')
Number of genes in each class:
genes_per_class <- sapply(1:n_cuts, function(i) sum(gene_classes == i))
df_gpc <- data.frame(Class=1:n_cuts, 'Number in class' = genes_per_class)
print(kable(df_gpc, caption='Genes per class'))
| Class | Number.in.class |
|---|---|
| 1 | 363 |
| 2 | 206 |
| 3 | 529 |
| 4 | 78 |
| 5 | 546 |
| 6 | 123 |
# names(gene_classes) <- sapply(as.character(names(gene_classes)), function(gn) strsplit(gn, '[.]')[[1]][1])
#
# base_name <- "/net/isi-scratch/kieran/embeddr/embeddr/data/"
# for(i in 1:n_cuts) {
# filename <- paste0(base_name, 'genes_', i,'.txt')
# conn <- file(filename)
# gn <- names(which(gene_classes == i))
#
# writeLines(gn, conn)
# close(conn)
# }
Now look at enriched GO terms for each class:
genes_per_class <- sapply(1:n_cuts, function(i) 1 * (df_gene$class == i))
gene_names <- df_gene$gene
all_genes <- featureNames(sce_23_kept)
gene_names <- sapply(as.character(gene_names), function(gn) strsplit(gn, '[.]')[[1]][1])
all_genes <- sapply(as.character(all_genes), function(gn) strsplit(gn, '[.]')[[1]][1])
genes_not_de <- setdiff(all_genes, gene_names)
genes_per_class <- rbind(genes_per_class, matrix(0, ncol=ncol(genes_per_class), nrow=length(genes_not_de)))
rownames(genes_per_class) <- c(gene_names, genes_not_de)
enriched_terms <- apply(genes_per_class, 2, function(gene_set) {
pwf <- nullp(gene_set, 'hg19', 'ensGene', plot.fit=FALSE)
go <- goseq(pwf, 'hg19', 'ensGene', test.cats=c('GO:BP'))
go$log_qval <- log10(p.adjust(go$over_represented_pvalue, method='BH'))
go <- dplyr::filter(go, log_qval < log10(0.01))
go <- dplyr::select(go, category, log_qval, term)
names(go) <- c('Category','log10 q-value','Term')
return(go)
})
reduced <- lapply(enriched_terms, head, 6)
Now print the terms:
for(i in 1:n_cuts) {
print(kable(reduced[[i]], caption = paste('GO terms for cluster', i)))
}
| Category | log10 q-value | Term |
|---|---|---|
| GO:0000278 | -48.34807 | mitotic cell cycle |
| GO:0007049 | -41.35798 | cell cycle |
| GO:1903047 | -39.28129 | mitotic cell cycle process |
| GO:0022402 | -37.59519 | cell cycle process |
| GO:0000280 | -37.47330 | nuclear division |
| GO:0048285 | -35.44262 | organelle fission |
| Category | log10 q-value | Term |
|---|---|---|
| GO:0006415 | -11.82966 | translational termination |
| GO:0006613 | -11.82966 | cotranslational protein targeting to membrane |
| GO:0045047 | -11.82966 | protein targeting to ER |
| GO:0072599 | -11.82966 | establishment of protein localization to endoplasmic reticulum |
| GO:0006414 | -11.39396 | translational elongation |
| GO:0006614 | -11.09383 | SRP-dependent cotranslational protein targeting to membrane |
| Category | log10 q-value | Term |
|---|---|---|
| GO:0030198 | -9.223980 | extracellular matrix organization |
| GO:0043062 | -9.223980 | extracellular structure organization |
| GO:0032502 | -8.653863 | developmental process |
| GO:0007275 | -8.316083 | multicellular organismal development |
| GO:0044767 | -8.311734 | single-organism developmental process |
| GO:0009888 | -8.164589 | tissue development |
Table: GO terms for cluster 4
Category log10 q-value Term ——— ————– —–
| Category | log10 q-value | Term |
|---|---|---|
| GO:0061061 | -18.08665 | muscle structure development |
| GO:0003012 | -17.39791 | muscle system process |
| GO:0006936 | -16.64856 | muscle contraction |
| GO:0030049 | -16.64856 | muscle filament sliding |
| GO:0033275 | -16.64856 | actin-myosin filament sliding |
| GO:0007517 | -16.63826 | muscle organ development |
| Category | log10 q-value | Term |
|---|---|---|
| GO:0006457 | -8.735674 | protein folding |
| GO:0051084 | -5.922023 | ‘de novo’ posttranslational protein folding |
| GO:0006458 | -5.616627 | ‘de novo’ protein folding |
| GO:0006986 | -3.392902 | response to unfolded protein |
| GO:0035966 | -3.147552 | response to topologically incorrect protein |